Pipelined systolic array for matrix-matrix multiplication

ABSTRACT

A digital data processor for matrix/matrix multiplication includes a systolic array of nearest neighbor connected gated full adders. The adders are arranged to multiply two input data bits and to add their product to an input cumulative sum bit and a carry bit from a lower order bit computation. The result and input data bits are output to respective neighboring cells, a new carry bit being recirculated for later addition to a higher order bit computation. Column elements of one matrix and row elements of the other are input to either side of the array bit-serially, least significant bit leading, for mutual counterpropagation therethrough with a cumulative time delay between input of adjacent columns or rows. Bit-level matrix interactions for product matrix computation occur at individual cells. Pairs of intercalated adder trees are connected switchably to the array to accumulate bit-level contributions to product matrix elements.

This invention relates to a digital data processor for matrix-matrixmultiplication, and more particularly to such a processor incorporatinga pipelined systolic array.

Systolic arrays of processing cells are known as a means of executingcomputations, as set out for example by Kung and Gentleman in "MatrixTriangularisation by Systolic Arrays", SPIE Vol 298, Real Time SignalProcessing 4 (1981) pp 19-26. Such arrays are controlled solely byexternal clocking means which "pump" array operation in the manner of aheart beating, hence the term systolic. Data processing carried out bythe array depends on individual cell functions and intercellconnections, rather than on a stored computer programme.

Systolic arrays are highly suited to pipelined operations. A sequence ofoperations is said to be pipelined if a data element can enter thesequence before a preceding element has left it. Pipelining allowsreduction in the number of devices which are temporarily idle.

Pipelined systolic arrays employing complex individual processing cellsare known. The Kung and Gentleman array (ibid) for example would requireindividual word-level cells such as microprocessors. More recently, U.S.Pat. No. 4,639,857 to McCanney et al describes arrays in whichindividual cells operate on single data bits, providing bit-levelsystolic arrays. Inter alia, that application describes bit-level arraysfor multiplication of (1) two numbers, (2) two vectors and (3) a matrixand a vector. Bit-level arrays are highly suitable for implementation asvery large scale integrated (VLSI) circuits, unlike arrays of morecomplex processing cells.

It is an object of the invention to provide a means for multiplyingtogether two matrices.

The present invention provides a digital data processor for multiplyingrow elements of a first input matrix by column elements of a secondinput matrix to form a product matrix, the processor including:

a systolic array of processing cells arranged for bit-levelmultiplication;

data input means arranged to effect bit-level matrix element input tothe array for multiplication of each bit of each first matrix rowelement by each bit of each of the second matrix elements in arespective column;

accumulating means arranged to sum array output contributions to eachbit of each product matrix element; and

clocking means to control operation of the processing cells, data inputmeans and accumulating means.

The invention provides a means for matrix/matrix multiplication which isparticularly convenient to implement by VLSI techniques. The data inputmeans may be arranged for matrix element input bit-serially with leastsignificant bits leading.

The processor of the invention may incorporate processing cells whichare nearest-neighbour connected and have processing functions toevaluate the product of two input data bits, add the product to an inputcumulative sum bit and a carry bit from a next lower order bitcomputation, output a corresponding result, generate a new carry bit andpass on the input data bits to respective neighbouring cells.

The data input means may be arranged to input the matrices with zerosinterspersed between adjacent bits of each matrix element. The clockingmeans is then arranged to advance each input matrix one processing cellthrough the array after each processing cell operation, and a cumulativetime stagger is applied between adjacent rows or columns as appropriateof the input matrices.

The data input means may alternatively be arranged to input matrixelement bits without interspersed zeros. The clocking means is thenarranged to advance both adjacent columns of one matrix and adjacentrows of the other on alternate processing cycles such that each advancedcolumn or row interacts with a respective stationary row or column.Alternate columns or rows of the matrices as appropriate are input witha cumulative time stagger therebetween.

The accumulating means preferably includes pairs of mutally intercalatedadder trees for summing bit-level product matrix elements, bit-levelarray output contributions being switchably connected to one or other oftwo respective adders in different trees. The accumulating means mayalso include means for switching cell output between adder trees insynchronism with the entry of a leading bit of an input matrix elementto that cell. The switching means may include two control lines eachcontaining a respective series of latches with one respective latch foreach output cell, means for generating pulses to propagate along thecontrol lines in synchronism with leading matrix element bits, and arespective switching circuit for switching each output cell betweenadder trees when either but not both respective latches receive a pulse.

In order that the invention might be more fully understood, embodimentsthereof will now be described with reference to the accompanyingdrawings, in which:

FIG. 1 is a schematic drawing of a word-level systolic array formultiplying two matrices;

FIGS. 2 to 4 are schematic drawings of a part of a bit-level systolicarray for a digital data processor of the invention, and illustratethree successive processing cycles;

FIGS. 5 and 6 are schematic drawings of single and multiple datainteraction regions;

FIG. 7 is a schematic drawing of an array for a processor of theinvention;

FIGS. 8 and 9 illustrate processing cell functions;

FIG. 10 is a schematic drawing of product matrix accumulating means;

FIG. 11 is a drawing of an adder cell for the accumulating means of FIG.10;

FIG. 12 is a drawing of a circuit for switching between adder trees inthe accumulating means of FIG. 10;

FIG. 13 is a drawing of two product matrix interaction regions; and

FIG. 14 is a drawing of a contracted product matrix interaction regionwithin an array.

Referring to FIG. 1, a 4×7 systolic array indicated generally by 10comprises individual word-level processing cells indicated by squares11_(ij) arranged in rows 12 and columns 13. The suffixes i and jindicate row and column positions. Cell interconnections are not shown.Two 4×4 input matrices A and B having respective coefficients a_(ik) andb_(kj) mutually counter-propagate across the array 10 in the directionof arrows 14_(A) and 14_(B). Each column of A (a_(1m) to a_(4m)) andeach corresponding row of B (b_(m1) to b_(m4)) move in the samerespective row 12 of cells 11. Each cell 11 is a word-levelmultiplier/accumulator arranged to multiply together column elements ofA and row elements of B to form elements c_(ij) of a product matrix Cgiven by ##EQU1##

Word-level elements a_(ik) and b_(kj) are input to each cell from theright and left respectively, together with a cumulative product from therespective cell above. On each processing cycle, each cell forms apartial product a_(ik) b_(kj), adds it to the input cumulative productand outputs the result to the cell below. Elements a_(ik) and b_(kj) arethen passed on respectively to left and right hand neighbouring cells.Individual elements of each column of A and corresponding row of B movein a respective row 12 meeting each other at cells 11 such that eachcolumn element of A multiplies each element in a corresponding row of B.The sum of product terms provides a respective element of the matrix Cas indicated by Equation (1). To ensure that all elements meet correctlyfor multiplication, columns of A and rows of B are each input with zerosinterspersed between elements and with a respective one cell cumulativetime stagger between data in successive rows 12. Because of the timestagger, the arrays of data in the matrices A and B can be considered asleftward and rightward leaning parallelograms as indicated at 15_(A) and15_(B).

The product matrix C can be said to occupy a diamond shaped region15_(C) which moves as indicated by arrow 14_(C) down into the array 10as A and B move through one another. It will be appreciated that theregions of the parallelograms 15_(A) and 15_(B) and diamond 15_(C)outside the array 10 have no physical significance, but are an aid inunderstanding data propagation, interaction and product formation. Asshown in FIG. 1, the lowest vertex 16 of the diamond 15_(C) has justentered the uppermost row 12 of the array 10, and encompasses a cell11₁₄ where matrix elements a₁₁ and b₁₁ have met to form part of aproduct element c₁₁. The partial contribution to element c₁₁ is outputdownwards. On the next processing cycle, A and B or parallelograms15_(A) and 15_(B) will have moved on by one processing cell in thedirections of arrows 14_(A) and 14_(B) respectively. The interactionregion or diamond 15_(C) accordingly moves down one array row 12, sothat vertex 16 accommodates cells 11₂₄, 11₁₃ and 11₁₅. The movement of Aand B brings elements b₂₁ and a₁₂ together for multiplication at cell11₂₄ and addition to the product term calculated by cell 11₁₄. Asparallelograms 15_(A) and 15_(B) move through each other, diamond 15_(C)moves down through the array, and element c₁₁ of C collects all fourinner product terms of the kind a_(1k) b_(k1) which it includes. Theelement c₁₁ is then given by the output of cell 11₄₄. Subsequently,elements c₂₂, c₃₃ and c₄₄ emerge from cell 11₄₄. In a similar fashion,other elements c_(ij) (i≠j) appear at the outputs of cells 11 in thefourth row having collected all corresponding inner product terms a_(ik)b_(kj).

Reference is now made to FIGS. 2, 3 and 4, in which like parts have likereferences. These drawings show a small part 20 of a bit-level systolicarray (not shown in full) on three successive array processing cycles,and illustrate bit-level implementation of the FIG. 1 word-levelarrangement. The array 20 has bit-level processing cells 21 havingsuffixes, i.e. 21_(ij), indicating row and column positionsrespectively. The cells 21 are one-bit gated full adders with nearestneighbour interconnections (not shown) as will be illustrated in detaillater. Each cell 21 receives input matrix element data bits from itsleft and right hand nearest neighbour cells together with a cumulativeproduct bit from its upper nearest neighbour. It multiplies together thematrix element data bits, adds the cumulative product thereto andoutputs the result to its lower nearest neighbour. Data bits previouslyreceived from the left and right are passed on to neighbouring cells tothe right and left respectively. This takes place on each processingcycle under the control of clocking means.

Two 4×4 matrices A and B having three-bit elements a_(ik) and b_(kj) areinput to the array 20 with the respective least significant bit (lsb)leading. The most significant bit (msb) may be a sign bit, and bitsignificance is indicated by superscript parenthesis (n), where n=0, 1or 2 in ascending order of significance. Individual bits are input toeach row of the array 20 interspersed with zeros, and a one cellprogressive time delay or stagger is introduced between the bits inputto adjoining rows, the stagger being cumulative down the array 20. Tomuliply two matrices, the rows of one matrix must interact with thecolumns of the other. Moreover, to multiply two bit-level words, eachbit of one word must multiply each bit of the other. To implement this,the elements in the first column of A, i.e. a₁₁.sup.(n) to a₄₁.sup.(n)(n=0, 1 or 2), are input bit serially, lsb leading and sequentially intothe uppermost row (cells 21₁₁ to 21₁₇) of the array 20, only a₁₁.sup.(n)being shown. Similarly, elements a₁₂ to a₄₂, a₁₃ to a₄₃ and a₁₄ to a₄₄in the second, third and fourth columns of A are input to the second,third and lowest rows of the array 20, only elements a₁₂, a₁₃ and a₁₄being shown. The bit-level elements of A, a_(ik).sup.(n), move one cellto the left on each processing cycle as indicated by arrow 22.

Similarly, elements in the first row of matrix B, b₁₁.sup.(n) tob₁₄.sup.(n) are input to the uppermost row (cells 21₁₁ to 21₁₇) of thearray 20, only b₁₁.sup.(n) being illustrated. Moreover, elements b₂₁ tob₂₄, b₃₁ to b₃₄ and b₄₁ to b₄₄ are input to the other three rows of thearray 20, only b₂₁, b₃₁ and b₄₁ being shown. The one-cell progressivetime stagger between data input to adjacent rows is retained. Thebit-level elements of B, b_(kj).sup.(n), move one cell to the right eachprocessing cycle as indicated by arrow 23.

By virtue of the time stagger, the A and B matrix element bitsillustrated, a_(1k).sup.(n) and b_(k1).sup.(n) (n=0 to 2, k=1 to 4),occupy respective leftward and rightward leaning parallelograms 24 and25 which overlap in part-diamond shaped interaction regions 26₁, 26₂ and26₃ in FIGS. 2, 3 and 4 respectively.

FIGS. 2 to 4 show bit-level interactions forming c₁₁. More generally,product terms c_(ij) are formed as follows. The rth bit of c_(ij) isgiven by: ##EQU2## where n=0, 1 or 2 for three bit words and m=4 for 4×4matrices A and B.

The time stagger applied across the array 20 to input matrix elementsmay be achieved by the use of latches (not shown) in series with therows of cells. In FIG. 2, matrix elements b₄₁ and a₁₄ would be input tocells 21₄₁ and 21₄₇ respectively via a respective delay line of threelatches. Similarly, cells 21₃₁ /21₃₇ and 21₂₁ /21₂₇ would be in serieswith two latches and one latch respectively.

FIGS. 2, 3 and 4 exemplify the double summation of Equation (2). In FIG.2, in the interaction region 26, cells 21₁₃, 21₁₅ and 21₂₄ areevaluating a₁₁.sup.(0) b₁₁.sup.(1), a₁₁.sup.(1) b₁₁.sup.(0) anda₁₂.sup.(0) 21.sup.(0). Moreover, cell 21₂₄ adds a₁₁.sup.(0) b₁₁.sup.(0)(received from vertically above) to its evaluated product. On the nextprocessing cycle shown in FIG. 3, the cells each output their productsto the respective cell vertically below, and the A and B parallelograms24 and 25 move on one cell 21 to the left and to the right respectively.Consequently, the interaction region 26 moves down one cell 21 to becomeregion 26₂. Cells 21₂₃, 21₂₅ and 21₃₄ are now evaluating the productsa₁₂.sup.(0) b₂₁.sup.(1), a₁₂.sup.(1) b₂₁.sup.(0) and a₁₃.sup.(0)b₃₁.sup.(0), and each adds its product to the respective previouslyevaluated product it received from vertically above. As before, eachcell 21 outputs its sum of products to the respective cell verticallybelow. On the next processing cycle shown in FIG. 4, parallelograms 24and 25 move a further cell 21 to the left and right respectively, andinteraction region 26 moves down one cell to become region 26₃. Theoutputs of cells 21₃₃, 21₃₅ and 21₄₄ are the products a₁₃.sup.(0)b₃₁.sup.(1), a₁₃.sup.(1) b₃₁.sup.(0) and a₁₄.sup.(0) b₄₁.sup.(0)respectively together with the corresponding cumulative procuct sumreceived from vertically above in each case. On the next cycle (notshown), the output of cell 21₄₄ passes to accumulating means (not shown)connected to cells 21₄₁ to 21₄₇ in the lowermost row of the array 20 andcells 21₄₃ and 21₄₅ evaluate a₁₄.sup.(0) b₄₁.sup.(1) and a₁₄.sup.(1)b₄₁.sup.(0) respectively. The accumulating means accordingly receivesthe following outputs:

    Cell 21.sub.43 : a.sub.11.sup.(0) b.sub.11.sup.(1) +a.sub.12.sup.(0) b.sub.21.sup.(1) +a.sub.13.sup.(0) b.sub.31.sup.(1) +a.sub.14.sup.(0) b.sub.41.sup.(1)                                          (3.1)

    Cell 21.sub.45 : a.sub.11.sup.(1) b.sub.11.sup.(0) +a.sub.12.sup.(1) b.sub.21.sup.(0) +a.sub.13.sup.(1) b.sub.31.sup.(0) +a.sub.14.sup.(1) b.sub.41.sup.(0)                                          (3.2)

    Cell 21.sub.44 : a.sub.11.sup.(0) b.sub.11.sup.(0) +a.sub.12.sup.(0) b.sub.21.sup.(0) +a.sub.13.sup.(0) b.sub.31.sup.(0) +a.sub.14.sup.(0) b.sub.41.sup.(0)                                          (3.3)

In each multiplication and subsequent summation performed by a cell 21,a carry bit is generated. Each carry bit remains for one cycle on thesame respective cell site, being effectively recirculated for additionto the product to be evaluated on the subsequent processing cycle bythat cell. The expressions (3.1) to (3.3) are accordingly one bit wide,any carry generated during each summation having been left behind. Thisis valid because each row of cells 21 in the array evaluatesprogressively higher order bits on successive processing cycles; i.e.cells 21 evaluating the nth bit c₁₁.sup.(n) will evaluate the (n+1)thbit on the next cycle as the interaction region 26 moves down the array20. Inspection of FIGS. 2 to 4 shows that each horizontal row of cellswithin the part diamond shaped interaction regions 26 contains all thebit-level partial product terms appropriate to a respective bit of c₁₁,carry bits from the preceding processing cycle occupying cells 21containing two zeros.

On the processing cycle following that shown in FIG. 4, as has been saidcell 21₄₄ outputs the products in Expression (3.3) to accumulating means(to be described later) to give the lsb of c₁₁, c₁₁.sup.(0). One cyclelater, terms forming the second lsb c₁₁.sup.(1) emerge from cells 21₄₃(Expression 3.1), 21₄₄ (carry bit) and 21₄₅ (Expression 3.2) foraccumulation. On the subsequent cycle, the third lsb c₁₁.sup.(2) issummed analogously from cells 21₄₂ to 21₄₆ inclusive. Extending this, itwill be appreciated that in general the nth bit c₁₁.sup.(n) will bederived by summing 1, 3, 5, 5, 3, or 1 cell outputs as n goes from 1 to6, and is output bit serially lsb leading.

In the example discussed with reference to FIGS. 2 to 4, it has beenassumed for convenience that two three-bit numbers a_(ik) and b_(kj)will produce c_(ij) no greater than six bits. This assumption reducesthe size of the array to a scale which is convenient for illustratingthe operation of the invention. However, some values of c_(ij) mighthave more than six bits, and provision for this is made as follows.Referring to FIG. 5, a generalised array 50 contains counter-propagatinginput data parallelograms 51 and 52. The parallelograms 51 and 52accommodate individual words a_(ik) and b_(kj) defining an interactionregion 53 containing the product c_(ij) as previously defined byEquation (1). The elements a_(ik) and b_(kj) are of arbitrary but equalword length. Since c_(ij) is the sum of individual products of the forma_(ik) b_(kj), it will have 2m+log₂ n bits; here m is the number of bitsin a_(ik) and b_(ik), and n is the matrix size equal to the maximumvalue of k or the maximum number of word level products a_(ik) b_(kj) tobe accumulated. The maximum word length of each value of c_(ij) must beaccommodated vertically within the diamond 53. For this purpose, theinput word lengths of a_(ik) and b_(kj) are each increased by a numberof zeros equal to 1/2log₂ n added to the most significant or rearwardend of the relevant word. The added zeros are also interspersed withzeros as with the original bits of a_(ik) and b_(kj). This is effectedby data input means (not shown) before each input word enters the array,and is equivalent to spacing the input parallelograms 51 and 52 by"guard bands" 54 and 55 to accommodate word growth, i.e. carry bits ofc_(ij) propagating vertically up the interaction region 53. Non-integralvalues of 1/2log₂ n are rounded up to an integer for both a_(ik) andb_(kj).

Referring now to FIG. 6, there is shown a generalised array 60containing leftward leaning parallelograms 61_(ik) for elements a_(ik)and rightward leaning parallelograms 62_(kj) for elements b_(kj). FIG. 6represents two matrices A and B in the process of interaction to form amatrix C. Elements of A and B have respective guard bands 63. Wholly andpartially diamond-shaped regions 64_(ij) accommodate product matrixelements c_(ij). At the instant of time represented by FIG. 6, elementb_(k1) is in the process of interacting with elements a_(1k), a_(2k) anda_(3k) to form c₁₁, c₂₁ and c₃₁ ; b_(k2) is forming c₁₂ and c₂₂ witha_(1k) and a_(2k), and b_(k3) is forming c₁₃ with a_(1k). One cyclelater, the A and B parallelograms would each have moved on one cellonwards through the array and the c_(ij) diamonds one cell downwards.Each element c_(ij) vertically accumulates bit-level inner products asit passes through the array 60, and eventually emerges below the arrayfor accumulation. It can be seen that bit-level interaction regionsc_(ij) corresponding to interaction regions 26 in FIGS. 2 to 4collectively form a word-level interaction region 15_(C) in FIG. 1.

Referring now to FIG. 7, there is schematically shown a completesystolic array 70 of processing cells indicated by squares 71 equivalentto cells 21 in FIGS. 2 to 4, cell interconnections not being shown. Thearray 70 is designed for the multiplication of two 8×8 matrices A and Bhaving elements a_(ik) and b_(kj) each two bits wide. It is convenientto treat the case of two bit word lengths, because this gives FIG. 7 acomparatively reasonable size consistent with illustrating necessaryfeatures. As will be described later, extension to larger word sizes andbigger matrices is conceptually straightforward.

The minimum size of the array 70 arises as follows. Words must be inputwith a length (m+1/2log₂ n), where m=2 for two bit words and n=8, thesize of the matrix. The matrix words a_(ij) and b_(ij) must each beinput with a length of 2+1/2log₂ 8=31/2, rounded up to 4 bits. Thisgives c_(ij) equal to a maximum of 4+4 or 8 bits wide. The width of thearray 71 is determined by the criterion that it must accommodate atleast a full row of one input matrix overlapping a full column of theother. This ensures that all row elements of one matrix meet all columnelements of the other within the array 70. Since the matrices have eightrows and columns, and each element has m+1/2log₂ 8 bits (including guardbands), or 4 bits when rounded up, interspersed with zeros, the minimumwidth of the array 70 is 8×4×2 or 64 cells 71. The height of the array70 is 8 cells 71, since the array must accommodate the eight columns ofmatrix A superimposed on the eight rows of matrix B. This gives aminimum array size of 64×8 or 512 cells in this example.

The array 70 contains eight complete diamond shaped interaction regions72, together with seven upper and lower part diamond interaction regions73 and 74 respectively. The regions 73 and 74 terminate above and belowat the top and bottom rows 75 and 76 of the array 70. The interactionregions 73 and 74 are formed by the parallelograms 77 and 78. Theparallelograms 77 and 78 mutually counter-propagate to the right andleft in the array 70 as described with respect to FIGS. 2 to 4, and theregions 72 to 74 accordingly move down the array. It can be seen thatthe array 70:

(1) is evaluating c₁₈ . . . c_(i)(9-i) . . . c₈₁ in regions 72, wherei=1 to 8;

(2) has partly completed evaluating c₁₇ . . . c_(i)(8-i) . . . c₇₁ inregions 74, where i=1 to 7; and

(3) has begun evaluating c₂₈ . . . c_(i)(10-i) . . . c₈₂ in regions 73,where i=2 to 8.

Furthermore, by analogy with the word-level arrays 10 and 60 shown inFIGS. 1 and 6, it will be appreciated that,

(4) the terms c_(ij), where (i+j)=11 to 16,

have yet to be evaluated, and can be considered as being "above" thearray 70, the corresponding data parallelograms having yet to cross oneanother; and

(5) the terms c_(ij), where (i+j)=2 to 7,

have passed through the array 70 for summing by accumulating means (notshown), and can be considered as being "below" the array, thecorresponding data parallelograms having crossed.

As has been mentioned regarding the FIG. 1 word-level array 10,interaction regions other than 72 to 74, i.e. lying outside the array70, do not have physical significance but assist analysis of arrayoperation.

As further example of array dimensions required for multiplying twomatrices, 16×16 matrices of 8 bit words would require an array of 320×16cells. Array could be implemented as individual VLSI chips, or byemploying a number of linked chips.

Whereas the operation of the array 70 of the invention has beendiscussed in relation to the multiplication of n×n or square matrices,it may also be employed to multiply rectangular matrices, such asmultiplication of an m×p matrix A by a p×n matrix B. The array 70 wouldthen be p rows high, and would have a minimum length in number of cellsequal to the product of the word length of A or B elements withwhichever was the larger of m and n.

Referring now to FIG. 8, and individual cell 81 corresponding to cell 21of the array 20 (FIGS. 2 to 4) and cell 71 of the array 70 (FIG. 7) isshown together with interconnections to other cells (not shown). Thecell 81 is appropriate for multiplying together positive bit-levelnumbers. It is a gated full adder having lateral input lines 82 and 83for accepting A and B matrix data bits a and b progressing to the leftand to the right respectively. The input lines 82 and 83 containrespective data bit latches 84 and 85. The cell has a further verticalinput line 86 containing a latch 87 and a carry recirculation line 88containing a latch 89. Lateral output lines 90 and 91 are provided forpassing on a and b data bits to the left and right respectively, and avertical output line 92 for the cell computation output.

The cell 81 has four nearest neighbour cells (not shown) above, belowand to the left and right, as indicated in FIGS. 2 to 4 and 7 (apartfrom cells on the edges of the arrays 20 and 70), and operates asfollows. Input data bits a and b occupying latches 84 and 85 are clockedinto the cell 81 on a processing cycle, the data bits having beenobtained from lateral nearest neighbour cells to the right and leftrespectively. The cell 81 also receives a cumulative product bit c' fromthe latch 87, c' having been computed by the nearest neighbour cellvertically above on the preceding cycle. The cell 81 computes theproduct of a and b and adds to the results c' plus the carry bit cy'obtained on the preceding cycle. This generates a cumulative outputproduct c for output via line 92 to the nearest neighbour cellvertically below, together with a new carry bit cy. The gated full adderlogic functions which achieve this are as follows:

    c=c'⊕(a·b)⊕cy'                            (4.1)

    cy=[(a·b)·c']+[(a·b)·cy']+[c'·cy']                                                       (4.2)

After the above computation, output data bits a and b pass out laterallyto the left and right respectively for storage on latches equivalent to84 and 85 in input lines to lateral nearest neighbours. Similarly, cpasses out to a latch equivalent to 87 associated with the cellimmediately below. New data bits a, b and c' are then clocked in fromlatches 84, 85 and 87 and the cycle repeats. Clocking means to achievethis are well known and will not be described.

As regards cells on the edges of the arrays 20 and 70 not whollysurrounded by nearest neighbours, data input means (not shown) suppliesdata bits from the left and right. Lateral output data lines 90 and 91on the left and right array edges are unconnected, the output bits"falling out" of the array. The input lines 86 on the upper array edgeare initialised to zero, and the output lines 92 on the lower array edgeare connected to accumulating means (not shown) to be described later.

Referring now to FIG. 9, there is shown an alternative form of cell 94appropriate for processing twos complement numbers. The cell 94 isequivalent to cell 81 of FIG. 8 with the addition of a vertical inputcontrol line 95 incorporating a latch 96 together with a vertical outputcontrol line 97.

The cell 94 modifies the array (20 or 70) to handle twos complementwords in accordance with the Baugh Woolley algorithm, IEEE Trans. onComputers, Vol C-22, No 12, Dec 1973 pp 1045-1047. The multiplication oftwo twos complement words may be transformed into all positive partialproducts, provided that all negatively weighted partial products (thoseinvolving multiplication of a sign bit by a non-sign bit) arecomplemented and a fixed correction term is added to the final answer.If the numbers to be multiplied are m bits wide, the correction term hasthe value 2^(m) -2^(2m-1). A detailed analysis based on FIGS. 2 to 7would indicate that the partial products to be complemented are thosefalling on the upper left and right boundaries--but not the apex--of thediamond shaped interaction regions when present in the array. It will beappreciated that these products must be complemented as they move in thearray. In FIGS. 2 and 3 the upper left and right boundaries have yet toenter the array 20, and have only begun to enter in FIG. 4. They arehowever present in regions 72 and 74 in FIG. 7. The partial products tobe complemented are identified by means of the control function of thecell 94. An additional control bit is latched at 96 for input to thecell 94, the control bit being set to 1 when the complement of thepartial product a·b is to be added to the cumulative input sum c' toform the cell output c. The control bit is latched from cell to cellvertically down the array 20 or 70 in synchronism with the propagationof the edges of interaction regions, and is used to indicate cells atwhich complementing is required.

The logic function of the cell 94 is as follows, where ctrl indicatesthe control bit and other terms are as previously defined:

    c=[c'⊕{ctrl⊕(a·b)}⊕cy']               (5.1)

    cy=[{ctrl⊕(a·b)}·c']+[{ctrl⊕a·b)}.multidot.cy']+[c'·cy'].                                 (5.2)

As has been mentioned, the final results of the array computations arerequired to be corrected for the presence of unwanted sign/non-signcross-product terms. If the resultant value c_(ij) in Equation (2)emergent from the array 20 is the result of n additions of bit-levelmultiplications, the correction term is n×(2^(m) -2^(2m-1)) for m-bitwords. Correction may be achieved quite simply either by initialisingthe cumulative product inputs of top row processing cells, or by addingcorrections to the outputs of the accumulating means.

One form of handling twos complement numbers has been described indetail in published United Kingdom Patent Application No. GB 2,106,287A,and a second form is expected to be published shortly in the IEEE Trans.Circuits and Systems. In view of this a detailed analysis will not begiven here.

Referring now to FIG. 10, in which parts previously mentioned are likereferenced, there is shown an accumulating means or accumulator 100 forsumming the outputs of the FIG. 7 array 70 to calculate product matrixelements c_(ij) . Eleven cells 71₁ to 71₁₁ of the bottom row 76 of thearray 70 are shown, and the positions of the edges or lateral apices ofinteraction regions of FIG. 7 are indicated by vertical chain lines 101,102 and 103 passing through the middle of every fourth cell 71₂, 71₆ and71₁₀ respectively. As will be described, interaction regions encompassup to seven cells lying wholly between pairs of alternate vertical linessuch as 101 and 103, but not the adjacent cells on those lines.

Below the row of cells 71 are located mutually intercalated full addertrees of which three are indicated by the character O, X or Y withineach adder 108. Each tree comprises adders 108 arranged in upper, middleand lower ranks indicated by chain lines 109 to 111. Each upper rankadder 108 is arranged to sum the outputs of two cells 71, and eachmiddle or lower rank adder sums the outputs of two upper or middle rankadders respectively.

Referring now also to FIG. 11, in which like parts are like referenced,each adder 108 is a full adder as illustrated. It has two input lines115 incorporating respective latches 116, a carry recirculation line 117including a latch 118 and an output line 119. Operation of the adder 108and latches 116 and 118 is under the control of clocking meanscontrolling the array of processing cells 71. The adder 108 receives twoinput data bits p' and q' from two cells 71 or two adders 108 (notshown) immediately above, adds the data bits to a carry bit cy' from anearlier computation, and produces a sum bit s and a new carry bit cy.The sum bit s is passed on to the adder or output immediately below (notshown). The full adder logic function is given by:

    s←p'⊕q'⊕cy'                                   (6.1)

    cy+(p'·q')+(p'·cy')+(q'·cy').   (6.2.

The output of each of the cells 71₃ to 71₆ is connected either to addertree O or to adder tree X in accordance with the position of arespective cell output switch 112. Similarly, cells 71 ₇ to 71₁₀ may beconnected to adder tree O or Y in accordance with the positions offurther switches 112.

The arrangement of FIGS. 10 and 11 operate as follows, reference to FIG.7 being made also. Consider the diamond shaped interaction regions 72moving down the array 70. As illustrated, each region 72 encompasses onecell 71 in the bottom row 76 having its output summed by the accumulator100. On the next seven cycles, it will encompass three, five, seven,seven, five, three and one bottom row cells 71 in succession before"passing out" of the array 70 (cf movement of regions 26 in FIGS. 2 and4). The partial interaction regions 74 each encompass (as illustrated)seven bottom row cells 71 having summed outputs, and will pass out ofthe array 70 after three further cycles when each region 72 willencompass seven bottom row cells and each region 73 one. Inspection ofFIGS. 7 and 10 shows that the output of each bottom row cell 71 mustswitch from one adder tree (O, X or Y) to another when the upper orlower diagonal boundary of an interaction region such as 72 to 74 haspassed over it.

The separation between adjacent pairs of vertical chain lines (e.g.101-102 or 102-103) corresponds to halfwidths of diamond interactionregions 72 to 74. The switching of each cell 71 between two respectiveadder trees varies in accordance with the respective cell position alongthe bottom row 76. Cell 71₃ for example occupies the left hand side ofan interaction region for two cycles and then switches to occupy theright hand side of a successive interaction region for six cycles. Thecorresponding numbers of cycles for cells 71₄ and 71₅ are four and four,and six and two respectively. Cell 71₆ however remains in eachsuccessive interaction region for eight cycles, since it occupies themiddle of each region. The interaction region occupation scheme of cells71₃ to 71₆ applies to each subset of four cells, i.e. 71₇ to 71₁₀ and soon.

When located in the left hand half of any interaction region, any ofcells 71₃ to 71₅ would be switched to adder tree O. However, any one ofcells 71₇ to 71₉ would be switched to adder tree O when located in theright hand half of any interaction region. When changing from occupyingone half of an interaction region to the opposite half of a succeedingregion, each of the cells 71₃ to 71₉ would switch from one respectiveadder tree to the other. The fourth cells 71₆ and 71₁₀ remain switchedto adder trees O and Y respectively throughout. Pairs of cells 71_(12-n)and 71_(n) (n=3, 4 or 5) switch in synchronism.

Cells 71₃ to 71₉ represent a section of the bottom row 76 of the array70 vertically below the seven-cell maximum width of any one of theinteraction regions 72. On the processing cycle illustrated in FIG. 7,adder tree O would be summing the output of cell 71₆, cells 71₃ to 71₅would be switched to adder tree X and cells 71₇ to 71₉ to adder tree Ywith cell 71₁₀. One cycle later, the region 74 would move down toencompass the three cells 71₅ to 71₇ which would accordingly beconnected to adder tree O. Adder trees X and Y would each be connectedto five respective cells to the left and right of cells 71₅ to 71₇.Table 1 shows the cells 71 switched to adder tree O for eight successivecycles beginning with cycle 1 shown in FIG. 7.

                  TABLE 1                                                         ______________________________________                                        Cycle      Cells Connected                                                                            No of Outputs                                         Number     to Adder Tree O                                                                            Summed                                                ______________________________________                                        1          71.sub.6     1                                                     2          71.sub.5 to 71.sub.7                                                                       3                                                     3          71.sub.4 to 71.sub.8                                                                       5                                                     4          71.sub.3 to 71.sub.9                                                                       7                                                     5          71.sub.3 to 71.sub.9                                                                       7                                                     6          71.sub.4 to 71.sub.8                                                                       5                                                     7          71.sub.5 to 71.sub.7                                                                       3                                                     8          71.sub.6     1                                                     ______________________________________                                    

On the subsequent cycle, or cycle 9, the Table 1 sequence begins torepeat for the next interaction region (not shown) vertically aboveregion 72.

The cell outputs summed by adder tree O on each cycle are those requiredfor each successive bit of the product matrix element c_(ij)corresponding to interaction region 72. The bits emerge serially lsbleading from the lower rank adder 108 of adder tree O three processingcycles after leaving the bottom row 76 of cells 71. Three cycles aftercycle 8 of Table 1, the eighth bit of c_(ij) emerges from tree O, andconsists only of the carry bit recirculated on the respective lower rankadder from cycle 7. This is because the product evaluated by cell 71₆ oncycle 8 is that of two word-growth zeros. After the final carry bit hasemerged, the lower rank adder of tree O has been zeroed for computationof the lsb of the next interaction region (not shown) vertically aboveregion 72, and the Table 1 sequence then repeats.

Referring now also to FIG. 12, there is shown a circuit for the two-wayswitches 112 of FIG. 10, parts previously mentioned having likereferences. Each switch 112 comprises FET transistor switches 120 and121 for switching the output of cell 71 on line 122 to an adder 108 oftree O or X via line 123 or 124 respectively. The FET switches 120 and121 are controlled by the Q and Q outputs 126 and 127 of a D-typeflip-flop having a D or data input 129, a clock input 130 and a resetline 131. The Q output 126 is connected to the D input 129. The array 70is provided with two control lines 132_(O) and 132_(X) having respectivelatches 133_(O) and 133_(X), these not being shown in FIG. 7. Thecontrol lines 132_(O) and 132_(X) are arranged parallel to andimmediately below the bottom row 76 of the array 70, and are connectedto respective inputs 134 of a respective Exclusive-OR gate 135 for eachswitch 112.

The circuit of FIG. 12 operates as follows. As has been mentioned withreference to FIGS. 7 and 10, the two-way switches 112 are required toswitch the outputs of cells 71 in the bottom row of the array 70 fromone adder tree to another when the boundary of an interaction region hasmoved across it. Such boundaries are also the boundaries of rightwardand leftward moving data parallelograms. Switching occurs in synchronismwith the input of the first or least significant bit of each bit-serialdata word a_(ik) or b_(kj) to a cell 71. To effect switching, each ofcontrol lines 132_(O) and 132_(X) carries a pulse train of 1s eachfollowed by seven 0s progressing to the right and left respectively,each 1 pulse being synchronised with the passage of an lsb of arespective rightward or leftward moving data parallelogram through thebottom row 76 of cells 71. The circuit of FIG. 12 is initialised byoperation of the reset line 131, which sets the Q output 127 to zero andthe Q output 126 and D input 129 to one. This switches transistors 120and 121 on and off respectively, and connects cell 71 to adder tree O.At this initial moment, data parallelograms have yet to interact. As theparallelograms counter-propagate through the array 70, eventually 1pulses on line 132_(O) and 132_(X) will be clocked into and out oflatches 133_(O) and 133_(X). The EX-OR gate 135 provides a 1 output ifeither (but not both) its inputs are 1, indicating that the boundary ofone data parallelogram (only) has crossed bottom row cell 71. This EX-ORoutput 1 appears on the flip-flop clock input 130, and clocks theinitial D input value of 1 to the Q output 126, Q and D becoming 0.Transistors 120 and 121 switch to off and on respectively, connectingcell 71 to adder tree X instead of adder tree O as initially. Subsequentcontrol line pulses switch the cell 71 back and forth between addertrees in synchronism with the bottom row lsb of data parallelograms, asrequired to synchronise adder tree switching with interaction region orproduct matrix coefficient movement. The EX-OR gate 135 does not switchif both its inputs are 1, since this corresponds to a cell (e.g. 71₆ or71₁₀) positioned in the middle of successive interaction regions whichdoes not switch between adder trees.

From the moment of initialisation referred to above, there are periodsduring which adder tree outputs are ignored. These periods are the timestaken for data parallelograms to move across the array and interact, andfor the resulting interaction regions to move out to the lower rankadders 108. The relevant time will be longer the further thecorresponding interaction region is from the middle of the array 70.Provision for ignoring results during such settling periods is wellunderstood in the art.

The foregoing description of the invention has dealt with data inputwith bits interspersed with zeros. While both functional and convenientfor explanatory purposes, it incorporates 50% cell redundancy in thathalf the processing cells at any time are evaluating products involvinginterspersed zeros. This redundancy is capable of reduction. Referringnow to FIG. 13, there are shown two interaction regions 140 and 141.Region 140 is five cells wide, diamond shaped and equivalent tointeraction regions described previously, and includes data bitsinterspersed with zeros indicated by Xs and Os respectively. Region 141is a form of region 140 which has been contracted such that data bits Xmove to the right sufficiently to remove interspersed zeros. Thisresults in three columns of one, five and three data bits X in theregion 141 without interspersed zeros.

Referring now to FIG. 14, there are shown A and B matrix data regions150 (narrow lines) and 151 (broad lines) interacting to form a C productmatrix region 152 (chain lines) in a part 153 of a larger array (notshown), where A and B are 6×6 matrices of elements three bits wide. FIG.14 is similar to FIGS. 2 to 4 but with regions 150 and 152 contracted.The consequences of contraction or removal of interspersed zeros isindicated by the multilateral shapes of regions 150 to 152. As shown,and with terms as previously defined, bit-level matrix elementsa_(1k).sup.(n) and b_(k1).sup.(n) are in the process of interacting toform product matrix element c₁₁ (cf FIGS. 2 to 4). To compensate for thelack of interspersed zeros, the movement of rows of regions 150 and 151becomes more complex than that described for earlier examples. Themovement is as follows. On odd numbered array processing cycles,elements a_(1k).sup.(n) in odd numbered rows of region 150 (i.e. k=1, 3and 5) move one cell to the left, and elements b_(k1).sup.(n) in evennumbered rows (k=2, 4 and 6) move one cell to the right. All othermatrix elements stay fixed. On even numbered cycles elementsa_(kl).sup.(n) for which k=2, 4 and 6 move one cell to the left andelements b_(kl).sup.(n) for which k=1, 3 and 5 move one cell to theright. The analysis of this movement is similar to that given for FIGS.2 to 4, 7 and 10, and shows that product matrix terms are accumulated inthe appropriate manner. The analysis also extends naturally to a fullarray (cf FIGS. 6 and 7), since interaction regions equivalent to region152 nest together in a manner similar to diamond-shaped regions. Theeffect is that the number of processing cells in an array is reduced byhalf, with a corresponding reduction in the number of adders required inthe accumulating means. This effect is achieved as has been described bydata input means arranged to move alternate rows of one input matrixwhile columns of the other input matrix interacting therewith remainstationary, and to keep stationary the remaining rows of that inputmatrix while alternate columns of the other matrix move. This procedureis repeated on alternate array cycles such that each row or column of arespective input matrix experiences alternate movement and stationarycycles in anti-phase with the two respective adjacent matrix rows orcolums. Clocking means for achieving this are well known in the art. Toeffect alternating movement of data bits, each individual processingcell 81 or 94 (FIG. 8 or 9) is controlled in a slightly differentmanner. Latches 84 and 85 for a and b data bits respectively are clockedon alternate cycles, whereas vertical input latch 87, carry latch 89 andcontrol latch 96 are clocked every cycle.

A form of time stagger is applied to data input to the arrayincorporating the part 153, the stagger being more complex than thatdescribed with reference to FIGS. 2 to 4. Matrix elements in alternaterows are time delayed by one cycle or cell as compared to that the rowsabove. As matrix elements of A in each odd-numbered row move under theaction of clocking means, they establish a one-cell time stagger overthe respective adjacent elements immediately below, and this stagger isremoved on the next cycle when matrix elements of A in even numberedrows move. The equivalent takes place in antiphase for matrix elementsof B. This may be achieved by providing for matrix element input via onelatch in series with the third and fourth rows and two latches in serieswith the fifth and sixth rows. Larger arrays would be provided withseries input latches increasing by one every second row.

We claim:
 1. A digital data processor for multiplying row elements of afirst input matrix by column elements of a second input matrix to form aproduct matrix, the processor including:a systolic array of processingcells arranged for bit-level multiplication; data input means foreffecting bit-level matrix element input to the array for multiplicationof each bit of each first matrix row element by each bit of each of thesecond matrix elements in a respective column; wherein the data inputmeans is arranged both for matrix input with zeros interspersed betweenadjacent bits of each matrix element and for producing a cumulative timestagger between input of adjacent rows of the first matrix and adjacentcolumns of the second matrix, accumulating means for summing arrayoutput contributions to each bit of each product matrix element; andclocking means for controlling operation of the processing cells, datainput means and accumulating means, the clocking means being arranged toadvance both input matrices by one cell through the array on each clockcycle, wherein the processing cells are nearest-neighbour connected andhave processing functions to evaluate the product of two input databits, add the product to an input cumulative sum bit and a carry bitfrom a next lower order bit computation, output a corresponding result,generate a new carry bit and pass on the input data bits to respectiveneighbouring cells.
 2. A digital data processor according to claim 1wherein the data input means is arranged both for matrix input withoutzeros interspersed between adjacent matrix element bits and forproducing a cumulative time stagger between input of alternate rows ofthe first matrix and alternate columns of the second matrix, and whereinthe clocking means is arranged for advancing on alternate clock cyclesboth adjacent rows of the first matrix and adjacent columns of thesecond matrix such that advanced input matrix elements are multiplied bynon-advanced matrix elements on each cycle.
 3. A digital data processorfor multiplying row elements of a first input matrix by column elementsof a second input matrix to form a product matrix, the processorincluding:a systolic array of processing cells arranged for bit-levelmultiplication; data input means arranged to effect bit-level matrixelement input to the array for multiplication of each bit each firstmatrix row element by each bit of each of the second matrix elements ina respective column; accumulating means arranged to sum array outputcontributions to each bit of each product matrix element; theaccumulating means including pairs of mutually intercalated adder treesarranged to sum array output contributions to each bit of each productmatrix element; and clocking means to control operation of theprocessing cells, data input means and accumulating means.
 4. A digitaldata processor according to claim 3 including switching means arrangedto switch individual array outputs from one respective adder tree toanother.
 5. A digital data processor according to claim 4 includingcontrol means arranged to actuate the switching means in conjunctionwith entry of a leading bit of a matrix element into an array outputcell.
 6. A digital data processor according to claim 5 wherein thecontrol means includes two control lines each containing a respectivelatch associated with each array output cell, means for generatingcontrol pulses for propagation along the control lines in synchronismwith array throughput of leading matrix element bits, and subsidiaryswitching means for switching individual array outputs from onerespective adder tree to another in response to actuation by controlpulses.
 7. A digital data processor for multiplying row elements of afirst input matrix by column elements of a second input matrix to formproduct matrix elements, each matrix element being a digital word,comprising:an array of bit-level logic cells arranged in rows andcolumns, wherein each logic cell is arranged to:(a) input two matrixelement bits together with carry and cumulative sum bits, (b) computeoutput cumulative sum and carry bits corresponding to addition of theinput cumulative sum and carry bits to the product of the input matrixelement bits, (c) output both matrix element bits and the outputcumulative sum bit, and (d) recirculate the output carry bit on therespective cell to provide an input carry bit to a succeedingcomputation; nearest neighbour cell interconnection means for allowingmatrix element bit movement along array rows and for allowing cumulativesum generation to be cascaded down array columns, the interconnectionmeans including clock-activated latch means for bit storage and advance;wherein each row of the array is arranged to receive a respective firstmatrix row of elements and a respective second matrix column of elementsinput to mutually opposite row ends and disposed both bit and wordserially with least significant bits leading; wherein the latch meansare also for clock activation both to advance first matrix row elementsand second matrix column elements in counterflow along array rows and tocascade cumulative sum generation down array columns; and wherein thearray columns have final logic cells, each having a cumulative sumoutput for association with those of respective nearby cells, andfurther comprising a plurality of switching means for switching saidcumulative sum outputs; and first and second accumulating means,arranged to add the respective cumulative sum output to those of nearbyfinal cells of other columns at least predominantly to one side of oneof said cells as selected by each switching means, each switching meansbeing arranged to alternate cell output summing between the first andsecond accumulating means as appropriate to isolate and sumcontributions to different product matrix elements.
 8. A digital dataprocessor according to claim 7 wherein each of the first and secondaccumulating means comprises an adder tree, and each switching meansincludes control means arranged to actuate switching in synchronism withan input of a leading bit of a matrix element to the respective arraycolumn final cell associated therewith.
 9. A digital data processoraccording to claim 8 where the control means includes two control lines,each containing a respective latch associated with each array columnfinal cell, the control lines being arranged for propagation of controlpulses in synchronism with array throughput of leading matrix elementbits, and wherein each switching means is arranged to operate inresponse to receipt of each of the said control pulses.